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Abstract — This  paper  explores  the  problem  of  efficiently  order¬ 
ing  interprocessor  communication  operations  in  both  statically 
and  dynamically-scheduled  multiprocessors  for  iterative  dataflow 
graphs  with  probabilistic  execution  times.  In  most  digital  sig¬ 
nal  processing  applications,  the  throughput  of  the  system  is 
significantly  affected  by  communication  costs.  We  explicitly 
model  these  costs  within  an  effective  graph-theoretic  analysis 
framework.  We  show  that  ordered  transaction  schedules  can 
significantly  outperform  both  self-timed  schedules  and  dynamic 
schedules  for  moderate  task  execution  time  variability.  As  the 
task  execution  time  variability  increases,  we  show  that  first  self- 
timed  and  then  dynamic  scheduling  policies  are  preferred.  We 
perform  an  extensive  experimental  comparison  on  both  real  and 
simulated  benchmarks  to  gauge  the  effect  of  synchronization  and 
communication  overhead  costs  on  these  crossover  points. 

I.  Introduction 

Interprocessor  communication  (IPC)  operations  are  respon¬ 
sible  for  significant  execution  time  and  power  consumption 
penalties  in  multiprocessor  embedded  systems.  This  paper 
compares  trade-offs  for  different  IPC  ordering  strategies  in 
both  statically-scheduled  and  dynamically-scheduled  multipro¬ 
cessors  for  iterative  dataflow  specifications.  We  target  lower- 
cost,  shared  memory  embedded  architectures  in  which  IPC  is 
assumed  to  take  place  through  shared  memory.  Such  simple 
communication  mechanisms  are  common  in  embedded  sys¬ 
tems  due  to  their  simplicity  and  low  cost. 

II.  Previous  Work 

High-level  exploration  of  interprocessor  communication  in 
multiprocessor  architectures  has  received  more  attention  as  the 
complexity  and  size  of  these  architectures  has  increased.  Ko- 
gel  [1]  and  Pasricha  [2],  for  example,  have  recently  presented 
tools  for  early  exploration  of  bus-based  on-chip  communica¬ 
tion  architectures.  They  show  that  such  exploration  early  in  the 
design  is  essential  for  developing  efficient  implementations. 

Lee  and  Ha  [3]  discuss  four  general  scheduling  strategies — 
fully-static  (FS),  self-timed  (ST),  static-assignment  (SA), 
and  fully-dynamic  (FD) — for  multiprocessors.  Multiprocessor 
scheduling  can  be  divided  into  three  steps — assigning  actors 
to  processors  ( processor  assignment),  ordering  the  actors  as¬ 
signed  to  each  processor  ( actor  ordering),  and  determining 
when  each  actor  should  commence  execution.  All  of  these 


tasks  can  either  be  performed  at  run-time  or  at  compile  time 
to  give  us  different  scheduling  strategies. 

In  the  FS  strategy,  all  three  scheduling  steps  are  carried  out 
at  compile  time,  including  the  determination  of  an  exact  firing 
time  for  each  actor.  The  FS  strategy  only  works  when  tight 
worst-case  execution  times  are  available,  and  forces  system 
performance  to  conform  to  the  available  worst-case  bounds. 
In  the  ST  strategy,  on  the  other  hand,  processor  assignment 
and  actor  ordering  are  performed  at  compile  time,  but  run¬ 
time  synchronization  is  used  to  determine  actor  firing  times — 
an  ST  schedule  executes  by  firing  each  actor  invocation  A 
as  soon  as  it  can  be  determined  via  synchronization  that  the 
actor  invocations  on  which  A  is  dependent  have  all  completed 
execution.  In  the  SA  strategy,  the  processor  assignment  is 
performed  at  compile  time,  but  the  ordering  of  actors  on  each 
processor  is  determined  at  run  time.  In  the  FD  strategy,  all 
three  steps  (processor  assignment,  actor  ordering,  and  firing 
times)  are  determined  at  run  time. 

The  ordered  transaction  (OT)  method  [9]  falls  between  the 
FS  and  ST  strategies.  It  is  similar  to  the  ST  method  but  also 
adds  the  constraint  that  a  linear  ordering  of  the  communication 
actors  be  determined  at  compile  time,  and  enforced  at  run¬ 
time.  The  linear  ordering  imposed  is  called  the  transaction 
order  of  the  associated  multiprocessor  implementation. 

Sriram  [5]  shows  that  optimal  transaction  orders  can  be  de¬ 
rived  in  polynomial  time  if  IPC  costs  are  negligible;  however, 
the  performance  of  the  self-timed  schedule  is  an  upper  bound 
on  the  performance  of  corresponding  ordered  transaction 
schedules  under  negligible  IPC  costs.  Conversely,  Khandelia 
and  Bhattacharyya  [7]  show  that  when  IPC  costs  are  not 
negligible,  the  problem  of  determining  an  optimal  transaction 
order  is  NP-hard,  but  at  the  same  time  the  performance  of 
a  self-timed  schedule  can  be  exceeded  significantly  by  a 
carefully-constructed  transaction  order. 

In  this  paper,  we  examine  the  performance  of  OT  and  ST 
as  a  function  of  the  variability  of  task  execution  times,  and 
compare  them  with  the  FD  strategy. 

III.  Experiments 

We  developed  a  software  simulator  of  the  execution  of  self- 
timed  and  dynamic  iterative  schedules.  The  simulated  system 
is  a  shared-memory  architecture,  where  synchronizations  are 
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performed  by  accessing  the  shared  memory  bus.  Data  tokens 
associated  with  the  IPC  constraints  are  transferred  via  the 
shared  memory.  When  a  processor  tries  to  gain  access  to 
the  bus  for  a  synchronization  or  interprocessor  data  com¬ 
munication,  the  system  permits  access  if  the  bus  is  not  in 
use,  otherwise  it  denies  access  and  the  processor  waits  for  a 
specified  back-off  time,  then  tries  again. 

The  synchronization  cost  for  OT  is  much  lower  than  the 
synchronization  costs  for  ST.  In  the  OT  strategy  a  shared  bus 
access  takes  no  more  than  a  single  read  or  write  cycle  on 
the  processor,  and  the  overall  cost  of  communicating  one  data 
sample  is  two  or  three  instruction  cycles  [9]. 

Our  simulator  for  ST  operation  implements  both  the  Un¬ 
bounded  Buffer  Synchronization  (UBS)  and  the  Bounded  Buffer 
Synchronization  (BBS)  protocols  [9].  In  the  BBS  protocol,  the 
protocol  requires  one  local  memory  increment  operation  (the 
local  write  pointer)  and  one  write  to  shared  memory  (store 
write  pointer)  occur  after  the  source  node  of  the  synchroniza¬ 
tion  edge  has  executed. 

We  assume  an  architecture  where  all  synchronization  and 
memory  accesses  occur  in  a  single  shared  memory.  We  define 
a  parameter  (3  to  be  the  ratio  of  the  synchronization  time  to 
the  IPC  time.  Since  we  are  considering  HSDF  graphs  with 
one  data  token  produced  per  IPC  operation,  we  have  (3  >  2 
for  BBS  (at  least  2  memory  accesses  for  synchronization  for 
every  data  memory  access)  and  (3  >  4  for  UBS. 

The  simulator  for  FD  assumes  a  centralized  scheduler  with 
separate  control  signals  to  each  processor.  The  scheduler  keeps 
track  of  ready  tasks  and  ready  processors.  A  task  is  ready  when 
all  its  predecessors  in  the  application  graph  have  completed.  A 
processor  is  ready  if  it  is  not  executing  a  task  or  IPC  operation. 
Tasks  are  prioritized  according  to  ready  time  (the  earliest  time 
at  which  all  predecessors  have  completed)  with  ties  broken  by 
static  level  [9]1.  The  scheduler  attempts  to  place  the  highest 
priority  task  on  the  lowest  number  ready  processor  whenever 
a  new  ready  task  is  detected. 

We  implemented  the  heuristic  transaction  partial  order 
(TPO)  algorithm  [7]  to  determine  the  OT  task  ordering.  This 
heuristic  simultaneously  takes  IPC  costs  and  the  serialization 
effects  of  transaction  ordering  into  account  when  determining 
the  transaction  order. 

A.  Task  Execution  Times 

For  many  DSP  applications,  it  is  possible  to  obtain  accurate 
statistics  on  task  execution  times.  Probabilities  for  events  such 
as  cache  misses,  pipeline  stalls,  and  conditional  branches  can 
be  obtained  by  using  sampling  techniques  or  simulation  of  the 
target  architecture  [10].  We  utilize  the  task  execution  model 
in  [11],  where  each  task  Vi  in  the  task  graph  G  =  (V.  E) 
is  associated  with  three  possible  execution  times  eo,  ei,  or 
e2  with  probabilities  p0,  p1,  and  p2  respectively.  Here,  eo  is 
the  task  execution  time  given  in  the  benchmark  specification, 
ei  =  2eo  and  e2  =  4eo-  We  define  a  single  parameter  p  for 
the  degree  of  randomness  of  the  task  execution  times,  where 

1  The  static  level  of  a  task  t  is  defined  as  the  largest  sum  of  the  expected 
execution  times  along  any  directed  path  from  t  to  an  endnode  of  the  graph, 
over  all  endnodes  of  the  graph. 


Po  =  (1  —  p),  p\  =  p{  1  —  p),  and  p2  =  p2.  Note  that  for  this 
probability  distribution,  p  =  0.5  corresponds  to  the  highest 
degree  of  randomness.  Under  these  assumptions,  we  note  that 
the  FS  strategy  is  not  practical  for  any  p  >  0,  no  matter  how 
small,  since  an  FS  architecture  must  operate  with  e2  for  all 
tasks  in  order  to  assure  correctness. 

B.  Benchmarks 

The  benchmark  application  graphs  used  were  fairly  compli¬ 
cated,  ranging  from  between  9-764  nodes,  and  the  numbers 
of  processors  involved  ranged  from  3  to  8.  We  examined 
a  combination  of  real  and  synthetic  benchmarks.  For  the 
synthetic  benchmarks,  we  used  the  TGFF  [12]  algorithm. 

The  examples  fftl,  fft2,  nr\djft3  result  from  three  representa¬ 
tive  schedules  for  Fast  Fourier  Transforms  based  on  examples 
given  in  [13];  karplO  is  a  music  synthesis  application  based 
on  the  Karplus  Strong  algorithm  in  10  voices;  the  video  coder 
is  taken  from  [14],  and  cddat  is  a  CD  to  digital  audio  tape 
converter. 

IV.  Results 

Experiments  were  carried  out  to  compare  the  OT,  ST, 
and  FD  methods,  and  to  measure  the  performance  of  the 
TPO  heuristic  in  finding  transaction  orders.  For  the  OT  and 
ST  methods,  the  benchmarks  were  scheduled  using  the  DLS 
algorithm  [18]. 

We  used  the  task  execution  model  from  Section  III-A,  and 
calculated  the  average  iteration  periods  over  10000  iterations. 

We  define  a  parameter  a  that  quantifies  the  IPC  overhead 
in  a  given  schedule.  It  is  calculated  from  the  ratio  of  the 
total  IPC  time  (synchronization  plus  data  communication)  to 
the  total  execution  time  spent  on  computational  tasks  over  all 
processors.  Thus,  a  is  a  function  of  p,  the  schedule,  and  the 
relative  speed  of  processor  to  memory.  We  note  that  VLSI  is 
trending  toward  higher  relative  processor-to-memory  speeds 
(higher  a )  as  gate  lengths  decrease. 

Figure  1  plots  Tqt,  7'st-  and  Tfd  versus  the  parameter  p 
that  governs  the  degree  of  randomness  of  the  task  execution 
times  for  a  synthetic  application  benchmark  generated  by 
TGFF.  It  can  be  seen  that  the  dynamic  scheduling  approach 
performs  significantly  worse  than  OT  and  ST  for  low  task 
execution  time  randomness  (low  p),  but  that  FD  is  much 
less  sensitive  to  p.  The  dynamic  scheduling  algorithm  is  able 
to  adapt  (through  different  task  orderings  and  assignments) 
to  changes  in  execution  times,  while  the  task  orderings  and 
assignments  are  fixed  for  ST  and  OT.  This  behavior  of  FD 
compared  to  OT  and  ST  was  observed  with  all  the  benchmarks. 

Figure  2  plots  Tot  and  Xst  versus  p  for  different  values  of 
/ 3 ,  the  ratio  of  synchronization-to-IPC  overhead  described  in 
Section  III.  The  calculations  for  (3  =  0  do  not  correspond  to 
any  synchronization  protocol  in  our  architectural  model,  but 
are  given  as  a  point  of  reference.  Values  of  (3  <  2  would  be 
possible  if  a  separate  (faster)  memory  were  allocated  to  the 
synchronization  variables. 

The  iteration  period  increases  with  p  since  the  average 
execution  time  increases  with  p.  For  many  DSP  applications 
p  <  0.1  is  a  reasonable  assumption.  For  example,  if  e2 
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tgff  /3  =  1 


Task  variability  parameter  p 


Fig.  1.  OT,  ST,  and  FD  scheduling  for  an  application  graph  generated  by 
TGFF. 


irr  a(p  =  0)  =  0.24 


Probability  p 


Fig.  2.  Iteration  periods  Tot  and  Tgx  for  the  irr  benchmark  versus  amount 
of  randomness  in  task  execution  times  (parameter  p). 


FFT1  (3  =  4 


FFT1  (3  =  2 


corresponds  to  a  cache  miss,  e-\  to  a  processor  pipeline  stall, 
and  eo  to  the  base  execution  time  for  a  task,  p  =  0.1 
corresponds  to  a  1%  cache  miss  probability,  a  9%  pipeline 
stall  probability,  and  a  90%  probability  for  the  base  execution 
time. 

From  Figure  2  we  see  that  Tgx  increases  more  slowly  as 
a  function  of  p  than  does  Tot-  This  is  because  the  self- 
timed  schedule  has  more  flexibility  than  the  OT  schedule  (the 
OT  schedule  imposes  a  pre-determined,  global  ordering  of 
all  the  1PC  while  the  ST  does  not)  and  thus  is  better  able 
to  adapt  to  changes  in  task  execution  times.  This  behavior 
was  observed  with  all  the  benchmarks.  We  also  see  that  it  is 
possible  for  OT  to  outperform  ST  for  (3  =  0,  but  only  for  small 
p.  Comparing  Figure  2(a)  and  Figure  2  (b),  we  see  that  the 
slopes  of  the  curves  decrease  as  a  increases.  This  is  because 
the  IPC  operations  are  not  random,  and  so  as  IPC  increases, 


a  smaller  fraction  of  the  overall  execution  time  comes  from 
random  tasks. 

We  also  observe  that  the  relative  improvement  of  OT  over 
ST  increases  as  a  increases.  Figures  3  and  4  plot  the  ratio 
Tst/Tot  versus  a  for  the  irr  and  fftl  benchmarks.  For  the  irr 
benchmark,  1st /Tot  >  1  for  all  (3  >  2  and  p  <  0.4.  For  the 
fftl  benchmark  with  (3  =  2  and  p  =  0.4,  Tst/Tot  >  0.96, 
and  Xst/Tot  >  1  elsewhere.  As  discussed  above,  p  =  0.4 
represents  a  high  degree  of  uncertainty  for  task  execution  times 
in  DSP  applications  (with  p  =  0.5  representing  the  highest 
possible  degree  of  randomness  in  the  probability  distribution). 

Table  I  compares  the  performance  (iteration  period)  of  the 
ST  and  the  OT  schedules.  In  all  cases,  we  observe  that  the 
OT  strategy  outperforms  the  ST  strategy  for  (3  >  2.  As  noted 
before,  (3=2  and  (3  =  4  represent  lower  bounds  for  the  BBS 
and  UBS  synchronization  protocols,  respectively.  The  results 
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1  P  =  o 

p  =  0.1 
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II 

*X 

<N 

II 

sx 

II 

sx 

tgff  avg. 

(*,  *) 

0.21 

1.057 

1.206 

1.342 

1.046 

1.190 

1.321 

fftl 

(28,32) 

0.04 

0.970 

1.024 

1.097 

0.997 

1.064 

1.088 

fftl 

(28,32) 

0.26 

0.858 

1.146 

1.483 

0.872 

1.135 

1.452 

fft2 

(28,32) 

0.02 

0.995 

1.035 

1.083 

1.042 

1.058 

1.118 

fft3 

(28,32) 

0.04 

1.020 

1.665 

2.147 

1.001 

1.632 

2.104 

karplO 

(21,29) 

0.19 

0.896 

1.372 

1.895 

0.896 

1.273 

1.687 

video  coder 

(9,9) 

0.735 

1.014 

1.117 

1.441 

0.933 

1.028 

1.326 

cddat 

(760, 764) 

0.375 

1.081 

1.271 

1.640 

1.006 

1.182 

1.525 

irr 

(41,69) 

0.72 

1.013 

1.405 

2.140 

1.001 

1.414 

2.130 

irr 

(41,69) 

0.24 

0.956 

1.246 

1.645 

0.934 

1.082 

1.470 

laplace 

(16,24) 

0.4 

0.857 

1.692 

2.406 

0.847 

1.603 

2.248 

laplace 

(16,24) 

0.14 

1.025 

1.316 

1.608 

0.979 

1.186 

1.376 

TABLE  I 

Tst/Tqt  FOR  ST  AND  OT  SCHEDULES. 


for  the  (synthetic)  TGFF  benchmark  are  an  average  over  50 
different  graphs  generated  by  the  TGFF  program  [12]. 

V.  Conclusions 

We  have  demonstrated  that  the  ordered  transaction 
method — which  is  superior  to  the  self-timed  method  in  its 
predictability,  and  its  total  elimination  of  synchronization 
overhead — can  significantly  outperform  self-timed  and  fully 
dynamic  implementations  for  low  task  variability,  even  though 
the  ordered  transaction  implementation  offers  less  run-time 
flexibility  due  to  a  fixed  ordering  of  communication  opera¬ 
tions.  When  synchronization  cost  is  taken  into  account,  the 
ordered  transaction  method  performs  significantly  better  than 
the  self-timed  and  fully  dynamic  methods. 

We  have  studied  the  relative  behavior  of  OT,  ST,  and  FD 
implementations  under  a  realistic  model  for  task  execution 
times.  The  OT  strategy  performs  better  relative  to  the  ST 
and  FD  strategies  for  lower  p  (degree  of  randomness  in 
task  execution  times),  higher  /?  (synchronization  costs)  ,  and 
higher  a  (IPC  overhead).  The  ranges  in  which  OT  favors  ST 
encompass  the  design  space  that  we  are  targeting — namely, 
low-cost  shared  bus  embedded  multiprocessor  DSP  systems. 

We  have  also  developed  a  detailed  simulator  to  measure 
the  performance  of  the  self-timed  schedule  under  different 
constraints. 

The  benefits  of  OT  can  be  expected  to  increase  with 
the  general  trend  in  VLSI  technology  for  increasing  proces¬ 
sor/memory  performance  disparity.  Some  of  this  benefit  may 
be  offset,  however,  by  another  trend,  which  is  for  decreas¬ 
ing  predictability  in  application  behavior  (and  thus  execution 
times)  due  to  the  use  of  more  and  more  sophisticated  and 
adaptive  types  of  algorithms.  The  evolution  of  the  MPEG 
standards  is  an  example  of  this.  Further  research  on  OT  meth¬ 
ods  to  efficiently  handle  such  lower  degrees  of  predictability 
is  therefore  an  interesting  and  important  direction  for  further 
study. 
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